Data Carpentry: Using R to Analyze and Visualize Data

Evan Cortens, PhD and Stephen Childs

October 21, 2018

Introductions

Introductions

  • Who are you/where do you work?
  • What software/tools do you spend most time with? (Excel? SPSS? Tableau? etc)
  • Do you have any programming experience? (Python? C? Javascript?)
  • What do you hope to get out of this workshop? How do you see yourself using R in an IR environment?s

Why use R?

Why use R?

  • Free
  • Open-source
  • A “statistical” programming language
    • Generally accepted by academic statisticians
    • Thoroughly grounded in statistical theory
  • A great community
    • Help and support
    • Innovation and development
  • R is a stable, well-supported language
    • Major corporate backing, e.g., Microsoft

Growing fast!

Growing fast!

https://stackoverflow.blog/2017/10/10/impressive-growth-r/

Some history

  • Based on the S programming language
    • Dates from the late 1970s, substantially revised in the late 1980s
    • (there is still a commercial version, S-PLUS)
  • R was developed at the University of Auckland
    • Ross Ihaka and Robert Gentleman
    • First developed in 1993
    • Stable public release in 2000
  • RStudio
    • IDE, generally recognized as high quality, even outside R
    • Development began in 2008, first public release in February 2011
      • Version 1.0 in Fall 2016
    • There have been other IDEs for R in the past, but since RStudio came on the scene, most are no longer seriously maintained

Some history

Why is this relevant?

  • You’ll likely notice some quirks in R that differ from other programming languages you make have worked in–the unique history often explains why.
  • Explains R’s position as a “statistical programming language”:
    • Think stats/data first, programming second.

Today’s approach/Learning outcomes

Today’s approach/Learning outcomes

In learning R today, we’ll take an approach, used in R for Data Science (Wickham & Grolemund, 2016), diagrammed like this:

One of the strengths of R: every step of this workflow can be done using it!

Import

  • Loading the data!
  • For IR-type tasks, this will generally be from data files or directly from databases.
  • Other possibilities include web APIs, or even web scraping

Tidy

  • “Tidy data” (Wickham, 2014)
    • Each column is a variable, each row is an observation.
  • Doing this kind of work up front, right after loading, lets you focus on the modelling/analysis/visualization problem at hand, rather than having to rework your data at each stage of your analysis.

Transform/Visualize/Model

  • Repeat these three steps as necessary:

Transform

  • Together with tidying, sometimes called data wrangling
    • Filtering (selecting specific observations)
    • Mutating (creating new variables)
    • Summarizing (means, counts, etc)

Visualise

  • For both exploratory purposes and production/communication

Model

  • Complementary to visualisation
  • For the “precise exploration” of “specific questions”

Communicate

  • The final output, whether it’s just for yourself, your colleagues, or a wider audience
  • Increasingly, there’s a trend toward “reproducible research” which integrates even the communciation step (final paper, report, etc) into the code/analysis.

Housekeeping

Installing R and RStudio

Basic R

  • An interpreted language (like Python)
  • Code can be run as:
    • scripts (programatically/non-interactively)
    • from the prompt (interactively)
  • R uses an REPL (Read-Evaluate-Print Loop) just like Python
    • Using R as a calculator (demonstration)

Outline

Outline

  1. Visualisation
  2. Transformation
  3. Importing and ‘wrangling’
  4. Hands-on portion using what we’ve learned

Visualisation

Let’s load our package

A package only needs to be installed once (per major version, e.g. 3.3.x to 3.4.x), but must be loaded every time.

The ‘mpg’ data set

Data on the fuel efficiency of 38 models of cars between 1999 and 2008 from the US EPA:

## # A tibble: 234 x 11
##    manufacturer model displ  year   cyl trans drv     cty   hwy fl    cla…
##    <chr>        <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <ch>
##  1 audi         a4      1.8  1999     4 auto… f        18    29 p     com…
##  2 audi         a4      1.8  1999     4 manu… f        21    29 p     com…
##  3 audi         a4      2    2008     4 manu… f        20    31 p     com…
##  4 audi         a4      2    2008     4 auto… f        21    30 p     com…
##  5 audi         a4      2.8  1999     6 auto… f        16    26 p     com…
##  6 audi         a4      2.8  1999     6 manu… f        18    26 p     com…
##  7 audi         a4      3.1  2008     6 auto… f        18    27 p     com…
##  8 audi         a4 q…   1.8  1999     4 manu… 4        18    26 p     com…
##  9 audi         a4 q…   1.8  1999     4 auto… 4        16    25 p     com…
## 10 audi         a4 q…   2    2008     4 manu… 4        20    28 p     com…
## # ... with 224 more rows

The Grammar of Graphics

  • Layers
  • Inheritance
  • Mapping (aes)

Our First Plot

  • A car’s highway mileage (mpg) vs its engine size (displacement in litres).
  • What might the relationship be?

Our First Plot

  • As engine size increases, fuel efficiency decreases, roughly.

Exercises

  • The first scatterplot was highway mileage vs displacement: how can we make it city mileage vs displacement?
  • Make a scatterplot of hwy vs cyl.
  • What about a scatterplot of class vs drv?

Hwy vs cyl

Class vs drv

What’s going on here? Is this useful? How might we make it more useful?

Additional aesthetics

What about those outliers? Use the color aesthetic.

‘Unmapped’ aesthetics

What’s happening here?

What’s gone wrong?

What’s happened here? What colour are the points? Why?

Exercises

  • mpg variable types: which are categorical, which are continuous?
  • ?mpg
  • Map a continuous variable to colour, size, and shape: how are these aesthetics different for categorical vs continuous?
  • What happens if you map an aesthetic to something other than a variable name, like aes(color = displ < 5)?

Facets: One Variable

**TODO: This might be a good point to talk about formulas - since we use the formula notation with facets.

Facets: Two Variables

Other “geoms” (geometries)

Smooth

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Smooth aesthetics

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Which aesthetics do geoms have?

?geom_smooth

Clearer?

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Reducing Duplication

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Exercise: Recreate:

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Exercise: Recreate:

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Exercise: Recreate:

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

One last visualization

A common scenario:

Coding basics

R as a calculator

## [1] 0.15
## [1] 44.66667
## [1] 1

Assignments

Calling functions

Calling R functions, in general:

function_name(arg1 = val1, arg2 = val2, ...)

An example function, seq:

##  [1]  1  2  3  4  5  6  7  8  9 10

Data wrangling with dplyr and tidyr (tidyverse)

Create RStudio project

  1. File | New Project…
  2. New Directory
  3. Empty Project
  4. Under “Directory name:” put in “CIRPA 2018”, under “Create project as a subdirectory of:”, pick somewhere you’ll remember.
  5. Click “Create Project”, and RStudio will refresh, with your new project loaded.

Download the data

We’ll look at CANSIM Tables 0477-0058 and 0477-0059 which cover university revenues and expenditures, respectively.

TODO: Should we change the examples from CANSIM to a built in dataset?

To save time, I’ve already downloaded the data from CANSIM. You can get it here:

http://evancortens.com/cirpa/

Load the data

We’ll use read_csv() from readr, which I prefer to the base R function, as it has some more sensible defaults, and it’s faster.

## Parsed with column specification:
## cols(
##   Ref_Date = col_character(),
##   GEO = col_character(),
##   SCHOOL = col_character(),
##   REVENUE = col_character(),
##   FUND = col_character(),
##   Vector = col_character(),
##   Coordinate = col_character(),
##   Value = col_double()
## )

The read_csv() function tells us that it guessed the types of the various columns. In this situation, the default guesses are fine, but of course we can force it to treat columns certain ways if we wish.

What does the data look like?

## # A tibble: 122,240 x 8
##    Ref_Date  GEO    SCHOOL      REVENUE  FUND     Vector Coordinate  Value
##    <chr>     <chr>  <chr>       <chr>    <chr>    <chr>  <chr>       <dbl>
##  1 2000/2001 Canada Total univ… Total r… Total f… v8070… 1.1.1.1    1.62e7
##  2 2001/2002 Canada Total univ… Total r… Total f… v8070… 1.1.1.1    1.72e7
##  3 2002/2003 Canada Total univ… Total r… Total f… v8070… 1.1.1.1    1.85e7
##  4 2003/2004 Canada Total univ… Total r… Total f… v8070… 1.1.1.1    2.16e7
##  5 2004/2005 Canada Total univ… Total r… Total f… v8070… 1.1.1.1    2.27e7
##  6 2005/2006 Canada Total univ… Total r… Total f… v8070… 1.1.1.1    2.47e7
##  7 2006/2007 Canada Total univ… Total r… Total f… v8070… 1.1.1.1    2.65e7
##  8 2007/2008 Canada Total univ… Total r… Total f… v8070… 1.1.1.1    2.72e7
##  9 2008/2009 Canada Total univ… Total r… Total f… v8070… 1.1.1.1    2.64e7
## 10 2009/2010 Canada Total univ… Total r… Total f… v8070… 1.1.1.1    3.19e7
## # ... with 122,230 more rows

We have 8 columns and 122,240 rows of data. read_csv() brings the data in as a tibble, which is just an R “data frame”, but with some handy defaults, some of which we’re seeing here. For instance, it gives us the size of the data frame in rows and columns, the types of the columns (e.g., “<chr>” for character) and only prints the first 10 rows, instead of overwhelming us with all of the data.

What does the data look like?

Or click the icon in the Environment tab.

What does the data look like?

  • Ref_Date: fiscal year
  • GEO: province or whole country
  • SCHOOL: Direct participation in CAUBO survey, or via Statistics Canada, or combined (we’ll just look at this)
  • REVENUE: Type of income (e.g., tuition, provincial government, SSHRC, etc)
  • FUND: Classification of income “in accordance with activities or objectives as specified by donors, … regulations, restrictions, or limitations” (We’re just going to look at “Total funds (x 1,000)”, but the distinctions are important for full analysis)
  • Vector: a CANSIM unique identifier
  • Coordinate: ditto
  • Value: the actual dollar value (x 1,000, per the FUND heading)

What makes data “tidy”?

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.

(See Wickham, 2014 and R for Data Science, chapter 12)

Is the CANSIM data tidy?

  • Nope! So let’s tidy it.

dplyr basics

  • Pick observations by their values (filter()).
  • Reorder the rows (arrange()).
  • Pick variables by their names (select()).
  • Create new variables with functions of existing variables (mutate()).
  • Collapse many values down to a single summary (summarise()).

  • All can be used in conjunction with group_by()

dplyr cheat sheet

https://www.rstudio.com/resources/cheatsheets/

(or Help | Cheatsheets in RStudio!)

filter() the rows we want

## # A tibble: 5,632 x 8
##    Ref_Date  GEO    SCHOOL      REVENUE  FUND     Vector Coordinate  Value
##    <chr>     <chr>  <chr>       <chr>    <chr>    <chr>  <chr>       <dbl>
##  1 2000/2001 Canada Total univ… Total r… Total f… v8070… 1.1.1.1    1.62e7
##  2 2001/2002 Canada Total univ… Total r… Total f… v8070… 1.1.1.1    1.72e7
##  3 2002/2003 Canada Total univ… Total r… Total f… v8070… 1.1.1.1    1.85e7
##  4 2003/2004 Canada Total univ… Total r… Total f… v8070… 1.1.1.1    2.16e7
##  5 2004/2005 Canada Total univ… Total r… Total f… v8070… 1.1.1.1    2.27e7
##  6 2005/2006 Canada Total univ… Total r… Total f… v8070… 1.1.1.1    2.47e7
##  7 2006/2007 Canada Total univ… Total r… Total f… v8070… 1.1.1.1    2.65e7
##  8 2007/2008 Canada Total univ… Total r… Total f… v8070… 1.1.1.1    2.72e7
##  9 2008/2009 Canada Total univ… Total r… Total f… v8070… 1.1.1.1    2.64e7
## 10 2009/2010 Canada Total univ… Total r… Total f… v8070… 1.1.1.1    3.19e7
## # ... with 5,622 more rows

filter() help tells us that everything after the first argument (i.e., the “…”) are: “Logical predicates defined in terms of the variables in .data. Multiple conditions are combined with &.”

Comparison operators

Logical operators

Other operators we can use with filter():

Complete set of boolean operations. x is the left-hand circle, y is the right-hand circle, and the shaded regions show which parts each operator selects.

filter()

In other words, the previous command is equivalent to this:

## # A tibble: 5,632 x 8
##    Ref_Date  GEO    SCHOOL      REVENUE  FUND     Vector Coordinate  Value
##    <chr>     <chr>  <chr>       <chr>    <chr>    <chr>  <chr>       <dbl>
##  1 2000/2001 Canada Total univ… Total r… Total f… v8070… 1.1.1.1    1.62e7
##  2 2001/2002 Canada Total univ… Total r… Total f… v8070… 1.1.1.1    1.72e7
##  3 2002/2003 Canada Total univ… Total r… Total f… v8070… 1.1.1.1    1.85e7
##  4 2003/2004 Canada Total univ… Total r… Total f… v8070… 1.1.1.1    2.16e7
##  5 2004/2005 Canada Total univ… Total r… Total f… v8070… 1.1.1.1    2.27e7
##  6 2005/2006 Canada Total univ… Total r… Total f… v8070… 1.1.1.1    2.47e7
##  7 2006/2007 Canada Total univ… Total r… Total f… v8070… 1.1.1.1    2.65e7
##  8 2007/2008 Canada Total univ… Total r… Total f… v8070… 1.1.1.1    2.72e7
##  9 2008/2009 Canada Total univ… Total r… Total f… v8070… 1.1.1.1    2.64e7
## 10 2009/2010 Canada Total univ… Total r… Total f… v8070… 1.1.1.1    3.19e7
## # ... with 5,622 more rows

I prefer this notation, as it’s more explicit.

filter()

But, one more thing: we need to assign the return value of the filter() function back to a variable:

Shortcut for assignment operator: Alt-Hyphen in RStudio

Combining operations

A new variable for each step gets cumbersome, so dplyr provides an operator, the pipe (%>%) that combines operations:

x %>% f(y) turns into f(x, y), and x %>% f(y) %>% g(z) turns into g(f(x, y), z) etc.

Shortcut for pipe operator: Ctrl-Shift-M in RStudio

## # A tibble: 1 x 4
##   Ref_Date GEO    REVENUE           Value
##      <int> <chr>  <chr>             <dbl>
## 1     2000 Canada Total revenues 16224715

Select helper functions

  • There are a number of helper functions you can use within select():
    • starts_with("abc"): matches names that begin with “abc”.
    • ends_with("xyz"): matches names that end with “xyz”.
    • contains("ijk"): matches names that contain “ijk”.
    • matches("(.)\\1"): selects variables that match a regular expression.
    • num_range("x", 1:3): matches x1, x2 and x3.
    • one_of(vector): columns whose names are in said vector.

Tidy data with tidyr

The two main tidyr functions:

  • gather(data, key, value): Moves column names into a key column, with the values going into a single value column.
  • spread(data, key, value): Moves unique values of key column into column names, with values from the value column.

tidyr Cheatsheet

https://www.rstudio.com/resources/cheatsheets/ (Data Import Cheatsheet)

spread() CANSIM

## # A tibble: 176 x 34
##    Ref_Date GEO   `Canada Foundat… `Canada Researc… `Canadian Insti…
##       <int> <chr>            <dbl>            <dbl>            <dbl>
##  1     2000 Albe…             8208             1750            42174
##  2     2000 Brit…             5189             1525            26960
##  3     2000 Cana…           212418            19802           329025
##  4     2000 Mani…             7170              550            10780
##  5     2000 New …             1745              300               96
##  6     2000 Newf…             2799                0             2771
##  7     2000 Nova…             3234                0             7521
##  8     2000 Onta…            91066             8533           117466
##  9     2000 Prin…              355                0              145
## 10     2000 Queb…            75988             7144           117205
## # ... with 166 more rows, and 29 more variables: `Credit courses
## #   tuiton` <dbl>, `Donations made by business enterprises` <dbl>,
## #   `Donations made by individuals` <dbl>, `Donations made by
## #   not-for-profit organizations` <dbl>, Endowment <dbl>, Federal <dbl>,
## #   Foreign <dbl>, `Grants made by business enterprises` <dbl>, `Grants
## #   made by Individuals` <dbl>, `Grants made by not-for-profit
## #   organizations` <dbl>, `Health Canada` <dbl>, Investments <dbl>,
## #   Miscellaneous <dbl>, Municipal <dbl>, `Natural Sciences and
## #   Engineering Research Council` <dbl>, `Non-credit tution` <dbl>,
## #   `Non-federal` <dbl>, `Other federal` <dbl>, `Other fees` <dbl>, `Other
## #   investments` <dbl>, `Other provinces` <dbl>, `Other revenue
## #   type` <dbl>, Provincial <dbl>, `Sale of service and products` <dbl>,
## #   `Social Sciences and Humanities Research Council` <dbl>, `Total
## #   donations` <dbl>, `Total grants` <dbl>, `Total revenues` <dbl>,
## #   `Tuition and other fees` <dbl>

Visualizing

Not the prettiest, but it works!

Visualizing: Labels

Inline Calculation

Exercises

  1. Visualize another variable
  2. Set some appropriate labels and scales
  3. Graph a time subset (e.g., between 2005 and 2010)

Other dplyr “verbs”

Which province took in the most tuition revenue in 2010?

## # A tibble: 10 x 2
##    GEO                       `Tuition and other fees`
##    <chr>                                        <dbl>
##  1 Ontario                                    3563011
##  2 British Columbia                           1009476
##  3 Quebec                                      825943
##  4 Alberta                                     711973
##  5 Nova Scotia                                 305597
##  6 Manitoba                                    176165
##  7 Saskatchewan                                173600
##  8 New Brunswick                               146213
##  9 Newfoundland and Labrador                    61537
## 10 Prince Edward Island                         27878

Create a new variable

## # A tibble: 176 x 5
##    Ref_Date GEO          `Tuition and othe… `Total revenues` tuition_share
##       <int> <chr>                     <dbl>            <dbl>         <dbl>
##  1     2000 Alberta                  298464          1716783         0.174
##  2     2000 British Col…             321017          2020474         0.159
##  3     2000 Canada                  3052960         16224715         0.188
##  4     2000 Manitoba                  99932           629163         0.159
##  5     2000 New Brunswi…              79153           356899         0.222
##  6     2000 Newfoundlan…              49954           268822         0.186
##  7     2000 Nova Scotia              173980           685133         0.254
##  8     2000 Ontario                 1509442          6192454         0.244
##  9     2000 Prince Edwa…              13778            63238         0.218
## 10     2000 Quebec                   407458          3613399         0.113
## # ... with 166 more rows

Keeps the same number of rows

Useful “mutations”

  • Arithmetic: +, -, *, /, ^
  • Modular arithmetic: %/% (integer division), %% (remainder)
  • Logs: log(), log2(), log10()
  • Offsets: lead(), lag()
  • Cumulative and rolling aggregates: cumsum(), cumprod(), cummin(), cummax(), cummean()
  • Logical comparisons: <, <=, >, >=, !=, ==
  • Ranking: min_rank(), row_number(), dense_rank(), percent_rank(), cume_dist(), ntile()
  • ifelse()
  • recode()
  • case_when()

Summarize (aggregate)

## # A tibble: 11 x 2
##    GEO                       avg_endowment_revenue
##    <chr>                                     <dbl>
##  1 Canada                                  526071.
##  2 Ontario                                 211475.
##  3 Quebec                                   87323.
##  4 Alberta                                  80069 
##  5 British Columbia                         75752.
##  6 Nova Scotia                              31148.
##  7 Saskatchewan                             21256.
##  8 New Brunswick                            13778.
##  9 Newfoundland and Labrador                 2878.
## 10 Manitoba                                  1646.
## 11 Prince Edward Island                       747.

Useful summarising functions:

  • mean(x), median(x)
  • sd(x), IQR(x) (interquartile range), mad(x) (median absolute deviation)
  • min(x), max(x), quantile(x, 0.25)
  • first(x), nth(x, 2), last(x)
  • n(x), n_distinct(x)
  • Counts and proportions of logical values: sum(x > 10), mean(y == 0)
    • TRUE is converted to 1 and FALSE to 0

Exercise:

Compute the total tuition revenue of the three western-most provinces in 2007.

## # A tibble: 1 x 1
##   `sum(\`Tuition and other fees\`)`
##                               <dbl>
## 1                           1374509

Group by new variable

These three provinces compared to all other provinces and national average.

## # A tibble: 3 x 2
##   category            `sum(\`Tuition and other fees\`)`
##   <chr>                                           <dbl>
## 1 3 Western Provinces                           1374509
## 2 All Other Provinces                           4079866
## 3 Canada                                        5454375

Change relative first year

Logical sums

Count years with positive endowment income by province

## # A tibble: 11 x 2
##    GEO                       num_pos_endow
##    <chr>                             <int>
##  1 Alberta                              13
##  2 British Columbia                     13
##  3 Canada                               13
##  4 Manitoba                             14
##  5 New Brunswick                        13
##  6 Newfoundland and Labrador            13
##  7 Nova Scotia                          14
##  8 Ontario                              12
##  9 Prince Edward Island                 13
## 10 Quebec                               16
## 11 Saskatchewan                         12

Exercises

  • Compare the Maritimes to the rest of Canada on one or more variable
  • Between 2005 and 2007, which provinces received more than $100 M in Total grants?
  • Which provinces, other than Nova Scotia and Ontario, had more than $20 M in Endowmennt income in 2003?
  • Which province was in “third place” in total revenue in 2012?

purrr (quickly)

TODO: I think we remove the purrr section - as per discussion.

## # A tibble: 16 x 5
##    Ref_Date revenue_quantiles  low_25      mid  high_75
##       <int> <list>              <dbl>    <dbl>    <dbl>
##  1     2000 <dbl [3]>         424965   681742. 1944551.
##  2     2001 <dbl [3]>         417664.  700424  2126550.
##  3     2002 <dbl [3]>         465112.  729434  2378510.
##  4     2003 <dbl [3]>         480404.  804636. 2708008.
##  5     2004 <dbl [3]>         530995.  853132. 2818174.
##  6     2005 <dbl [3]>         560023.  911819  3257599.
##  7     2006 <dbl [3]>         616412.  964536. 3320636.
##  8     2007 <dbl [3]>         592927.  944042. 3560364.
##  9     2008 <dbl [3]>         568461.  968932. 3404852.
## 10     2009 <dbl [3]>         741016. 1161066  4363439.
## 11     2010 <dbl [3]>         787082. 1213456  4436956.
## 12     2011 <dbl [3]>         775059. 1169906. 4177081 
## 13     2012 <dbl [3]>         799797. 1240542  4296184.
## 14     2013 <dbl [3]>         791843. 1298845  4437751.
## 15     2014 <dbl [3]>         799297. 1316785  4623504.
## 16     2015 <dbl [3]>         770571. 1261936. 4451983.

Vectors (and other types)

Missing data

## [1] NA
## [1] NA
## [1] NA

What about this?

NA / 2

Filtering and NAs

## [1] TRUE

filter() only includes rows where the condition is TRUE; it excludes both FALSE and NA values. If you want to preserve missing values, ask for them explicitly:

## # A tibble: 1 x 1
##       x
##   <dbl>
## 1     3
## # A tibble: 2 x 1
##       x
##   <dbl>
## 1    NA
## 2     3

Dealing with vectors…

## [1] 1 2 3
## [1] 1 2 3

Dealing with vectors…

## [1]  2  4  6  5  7  9  8 10 12

What about…

1:3 + 1:10

Dealing with vectors…

## Warning in 1:3 + 1:10: longer object length is not a multiple of shorter
## object length
##  [1]  2  4  6  5  7  9  8 10 12 11

What happened?

Missing values

## [1] 2

What do we expect here?

mean(c(1,NA,3))

Missing values

## [1] NA
## [1] 2

Vectors

Vectors

  • Coercion (explicit and implicit)
    • Type hierarchy
  • Atomic vectors (homogenous) vs lists (can nest/heterogeneous)
  • Names
  • Subsetting

Vectors: Naming

## x y z 
## 1 2 4
## a b c 
## 1 2 3

Vectors: Subsetting

## [1] "three" "two"   "five"
## [1] "two"  "four"
## xyz def 
##   5   2

Vectors: Logical subsetting

## [1] 10  3  5  8  1
## [1] 10 NA  8 NA

Lists

## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 3
## List of 3
##  $ : num 1
##  $ : num 2
##  $ : num 3

Lists: Can contain different types of objects:

## List of 4
##  $ : chr "a"
##  $ : int 1
##  $ : num 1.5
##  $ : logi TRUE

Lists: Or other lists:

## List of 2
##  $ :List of 2
##   ..$ : num 1
##   ..$ : num 2
##  $ :List of 2
##   ..$ : num 3
##   ..$ : num 4

Modeling

Types of Modeling

  • define a family of models - expressed as an equation - you are trying to estimate the parameters of the question
  • generate a fitted model that fills in the parameters

The goal

You are trying to get the model from the family that best fits the data. That doesn’t mean it’s a “good” or “true” model.

Simulated Data

The modelr package has a simulated dataset called sim1.

##  [1]  1  1  1  2  2  2  3  3  3  4  4  4  5  5  5  6  6  6  7  7  7  8  8
## [24]  8  9  9  9 10 10 10

Plotting sim1

Plotting sim1

  • we can see that there is a pattern from the scatter plot
  • the pattern looks like a straight line - geom_abline plots straight line on the graph

Using geom_abline

Using geom_abline

That’s just one model - how can we tell if it’s good or not?

Evaluating Predictive Models

We look at the difference between the predictions and the actual data.

Defining a model as an R function

We can find this for any model by creating an R function. We take the intercept and add the slope multiplied by x.

##  [1] 12.5 12.5 12.5 13.0 13.0 13.0 13.5 13.5 13.5 14.0 14.0 14.0 14.5 14.5
## [15] 14.5 15.0 15.0 15.0 15.5 15.5 15.5 16.0 16.0 16.0 16.5 16.5 16.5 17.0
## [29] 17.0 17.0

Difference between predicted and actual

Once we have the model’s prediction, we can look at the difference between those and what the actual data is.

##  [1]  -8.3000870  -4.9893659 -10.3745272  -4.0111426  -2.7568946
##  [6]  -1.7031768  -6.1436353  -2.9946506  -2.9883992  -1.5654109
## [11]  -2.1073988   0.2579641   4.6300498  -2.7619788   1.5248539
## [16]  -1.7260230   0.9559750   1.8947962   4.5859927   1.6718503
## [21]   4.4363088   5.7259025   2.3909129   6.4755526  10.2770099
## [26]   6.3051098   4.6283053   7.9680994   6.3464221   4.9752007

Using squared distances

We want to both put all distances as positive numbers and more heavily penalize predictions that are far from the actual values - so we square the diffference.

##  [1]  68.89144476  24.89377199 107.63081509  16.08926474   7.60046760
##  [6]   2.90081117  37.74425497   8.96793224   8.93052986   2.45051128
## [11]   4.44112956   0.06654547  21.43736106   7.62852691   2.32517942
## [16]   2.97915534   0.91388814   3.59025256  21.03132891   2.79508358
## [21]  19.68083613  32.78595957   5.71646454  41.93278203 105.61693163
## [26]  39.75440948  21.42120989  63.49060769  40.27707338  24.75262198

Summary Measure of Distance

We want to summarize it - so we take the mean across all the predictions and take the square root to put the differences back in terms of the original units.

## [1] 4.995789

Use this message to choose a model

So, to choose between models – we can just put in different values.

## [1] 2.665212

Random Models Compared

So, we can generate a lot of models and pick the one with the lowest error.

## # A tibble: 250 x 3
##        a1     a2  dist
##     <dbl>  <dbl> <dbl>
##  1  11.1  -0.485 10.4 
##  2 -14.9  -0.187 32.2 
##  3  27.1   1.70  21.1 
##  4 -15.9  -3.23  51.5 
##  5   5.41  0.851  6.76
##  6 -17.7   3.70  13.8 
##  7  15.1  -2.09  17.0 
##  8   5.10 -4.94  42.6 
##  9  11.4  -3.74  29.8 
## 10  -9.69 -3.77  48.9 
## # ... with 240 more rows

Plot the Random Models

The 10 Best Models

## # A tibble: 10 x 3
##          a1    a2  dist
##       <dbl> <dbl> <dbl>
##  1   1.39    2.31  2.66
##  2   3.57    1.88  2.72
##  3   7.42    1.34  3.05
##  4  -0.0249  2.50  3.06
##  5   8.54    1.62  3.14
##  6  11.0     1.33  4.06
##  7  -2.00    2.53  4.40
##  8  -3.00    2.63  4.87
##  9  -3.52    3.56  4.87
## 10  -5.34    3.26  5.00

Plot all the random models.

Plot those models on the data. Lighter colored models are closer.

Parameter Space

Now, we switch to parameter space… We plot every candidate model - showing the distance in color. We highlight our 10 best models in red.

Zoom In

So, you can zoom in on the area where the best models tend to be. Note the axes and the legend.

The new best 10 models

What do we get as the new best 10 models?

##          a1       a2     dist
## 1  4.375000 2.000000 2.137234
## 2  4.375000 2.083333 2.155409
## 3  3.333333 2.166667 2.168677
## 4  5.416667 1.916667 2.210294
## 5  3.333333 2.250000 2.212637
## 6  5.416667 1.833333 2.218550
## 7  4.375000 1.916667 2.241533
## 8  3.333333 2.083333 2.246169
## 9  4.375000 2.166667 2.293149
## 10 2.291667 2.333333 2.308274
##          a1       a2     dist
## 1  4.183673 2.061224 2.128424
## 2  4.693878 1.979592 2.139589
## 3  4.183673 2.020408 2.140222
## 4  3.673469 2.142857 2.144759
## 5  4.183673 2.102041 2.146651
## 6  3.673469 2.102041 2.150084
## 7  4.693878 2.020408 2.151341
## 8  4.693878 1.938776 2.157704
## 9  3.673469 2.183673 2.169193
## 10 5.204082 1.897959 2.177829

Plot the new best 10 models

Contour Map

A finer contour map

Overlay those 10 best models on the data.

Optimize the models

There is a tool to optimize the parameters so you get the smallest distance.

## [1] 4.222248 2.051204

R can do all of this in 1 command

Of course, R has a tool to do this whole process automatically. This is our standard OLS regression.

## (Intercept)           x 
##    4.220822    2.051533

Generate Predictions

We can use a model object like sim1_mod to generate predictions. We an start with an empty grid.

## # A tibble: 10 x 1
##        x
##    <int>
##  1     1
##  2     2
##  3     3
##  4     4
##  5     5
##  6     6
##  7     7
##  8     8
##  9     9
## 10    10
## # A tibble: 10 x 2
##        x  pred
##    <int> <dbl>
##  1     1  6.27
##  2     2  8.32
##  3     3 10.4 
##  4     4 12.4 
##  5     5 14.5 
##  6     6 16.5 
##  7     7 18.6 
##  8     8 20.6 
##  9     9 22.7 
## 10    10 24.7

Plot those predictions using geom_line.

Add residuals

Add residuals. We need to use the original dataset.

## # A tibble: 30 x 3
##        x     y    resid
##    <int> <dbl>    <dbl>
##  1     1  4.20 -2.07   
##  2     1  7.51  1.24   
##  3     1  2.13 -4.15   
##  4     2  8.99  0.665  
##  5     2 10.2   1.92   
##  6     2 11.3   2.97   
##  7     3  7.36 -3.02   
##  8     3 10.5   0.130  
##  9     3 10.5   0.136  
## 10     4 12.4   0.00763
## # ... with 20 more rows

Capstone Exercise

Thy these

Things to try:

  1. Find some outliers and why they happened (try googling specific dates).
  2. Estimate the effect that the seasons have on bike share usage? Day of the week?
  3. Build a model that predicts bike share usage for a given day. Separate your data into three parts.

Notes

TODO: Add more content here.

  • The R formula language – using concepts from how the Tidyverse handles fields.
    • data = .
    • describe how to go from equations to R formulas - the ~ is the = we would write in an equation. y ~ x translates to y = a_1 + a_2 * x
    • model_matrix function
    • categorical variables – does that mean we have to deal with factors?
  • statistical testings t.test - use the formula language – come up with examples.
  • modeling using lm - also use the formula language

Additional Resources

Online Resources

Start here:

Additional:

For help/community:

References